Part A 1.multiple regression analysis
# Load the package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.4
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
# Read the file
df<-read.csv("SalesPerformance.csv", header = TRUE, stringsAsFactors = FALSE)
# Regression use all the variables
RM1 <- lm(Profit ~ Area + Popn + Outlets + Commis, data = df)
summary(RM1)
##
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets + Commis, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.91 -93.91 25.22 108.02 505.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 694.6890 367.1055 1.892 0.06475 .
## Area -22.9564 8.2917 -2.769 0.00809 **
## Popn 101.7687 61.4965 1.655 0.10476
## Outlets 0.8301 1.5524 0.535 0.59540
## Commis 316.7449 68.4808 4.625 3.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.6 on 46 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6215
## F-statistic: 21.52 on 4 and 46 DF, p-value: 4.648e-10
The model above contains all four potential variables. The dependent variable is profit. The summary chart indicates that the regression model has a adjusted r-squared of 0.6215, which is fair enough. However, the p values also show that some variables such as Popn and Commis are not very significant in the model. Commis has the most effect on PROFIT.
# Plot the regression
plot(RM1)
The plot 1 shows that the residuals are not that random distributed.
The plot 2 shows that the some standardized residuals do not fit the line well.
The plot 3 and plot 4 show that outliers such as ‘19’, ‘32’, ‘47’ exist.
RM2 <- lm(Profit ~ Area + Popn + Outlets, data = df)
summary(RM2)
##
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -640.4 -203.6 20.0 175.7 456.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1396.2926 400.3114 3.488 0.00107 **
## Area -30.0000 9.7601 -3.074 0.00351 **
## Popn 52.1988 72.5125 0.720 0.47518
## Outlets -0.4619 1.8286 -0.253 0.80169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.2 on 47 degrees of freedom
## Multiple R-squared: 0.4898, Adjusted R-squared: 0.4572
## F-statistic: 15.04 on 3 and 47 DF, p-value: 5.381e-07
plot(RM2)
(2)Then, remove the variable OUTLETS
RM3 <- lm(Profit ~ Area + Popn + Commis, data = df)
summary(RM3)
##
## Call:
## lm(formula = Profit ~ Area + Popn + Commis, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -751.69 -90.93 21.43 101.61 499.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 792.701 315.650 2.511 0.01552 *
## Area -23.301 8.204 -2.840 0.00664 **
## Popn 116.444 54.613 2.132 0.03825 *
## Commis 310.156 66.849 4.640 2.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218.9 on 47 degrees of freedom
## Multiple R-squared: 0.6496, Adjusted R-squared: 0.6272
## F-statistic: 29.04 on 3 and 47 DF, p-value: 8.984e-11
plot(RM3)
RM4 <- lm(Profit ~ Outlets + Commis, data = df)
summary(RM4)
##
## Call:
## lm(formula = Profit ~ Outlets + Commis, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -815.56 -169.86 13.72 154.83 622.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -298.425 267.366 -1.116 0.269907
## Outlets 6.962 1.392 5.001 8.04e-06 ***
## Commis 329.694 87.125 3.784 0.000428 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.1 on 48 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3634
## F-statistic: 15.27 on 2 and 48 DF, p-value: 7.359e-06
plot(RM4)
The combination effect of COMMIS and OUTLETS can be evaluated by the the summary tables and plots from all the regression models above. First, the R square is the highest in RM3(0.6272 ), that is the the model that removes OUTLETS, it is even higher than RM1(0.6215).
The results indicate that the proportion of the variance in the dependent variable that is predictable from the independent variable(s) is decrease with the involvement of OUTLETS. However, the differences between the R squares is not that significant, so we can not conclude the variables have a negative effect to each other.
From the plots of RM1, RM2 and RM3, it is clear that with the involvement of both COMMIS and OUTLETS, the models fit better to the dataset. Moreover, in RM1, the estimate of COMMIS and OUTLETS are 0.8301 and 316.7449, the absolute value of which is higher than the absolute in RM2 and RM3, -0.4619 and 310.156. Also, from RM4 with just COMMIS and OUTLETS as variables, p value is the most significant and the residuals are randomly distributed. So we can conclude that the two variables do have combination effect.
# Add a new column
df$Profit_of_outlets<-(df$Profit)/(df$Outlets)
RM5 <- lm(Profit_of_outlets ~ Popn + Commis + Area, data = df)
summary(RM5)
##
## Call:
## lm(formula = Profit_of_outlets ~ Popn + Commis + Area, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9040 -0.6038 0.0779 0.8308 2.4466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0652 1.8471 3.825 0.000384 ***
## Popn 0.0258 0.3196 0.081 0.935995
## Commis 2.0420 0.3912 5.220 3.98e-06 ***
## Area -0.1536 0.0480 -3.200 0.002462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.281 on 47 degrees of freedom
## Multiple R-squared: 0.5527, Adjusted R-squared: 0.5242
## F-statistic: 19.36 on 3 and 47 DF, p-value: 2.58e-08
summary(RM1)
##
## Call:
## lm(formula = Profit ~ Area + Popn + Outlets + Commis, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.91 -93.91 25.22 108.02 505.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 694.6890 367.1055 1.892 0.06475 .
## Area -22.9564 8.2917 -2.769 0.00809 **
## Popn 101.7687 61.4965 1.655 0.10476
## Outlets 0.8301 1.5524 0.535 0.59540
## Commis 316.7449 68.4808 4.625 3.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.6 on 46 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6215
## F-statistic: 21.52 on 4 and 46 DF, p-value: 4.648e-10
Comparing the summary tables of RM1 and RM5, I find that although the adjusted R-squared decreases, the Pr(>|t|) of all the variables is decrease, suggesting higher level of significance.
plot(RM5)
plot(RM1)
The plot 1 shows that the residuals are more random distributed for RM5. The plot 2 shows that the some standardized residuals do not fit the line better for RM5.
The analysis above concludes that use PROFIT divided by OUTLETS works better as a dependent variable than PROFIT itself.
Part B (25 marks): Omit variables that show little predictive value. Recall the models above, the three varibles COMMIS, POPN and AREA. Multicollinearity should be examined.
vif(RM5)
## Popn Commis Area
## 3.674139 1.111555 3.470757
X <- subset(df,select=-c(Profit_of_outlets,Profit,Outlets,Dist))
ggpairs(X)
The chart above shows that variables POPN and AREA have multicollinearity. Moreover, the the Pr(>|t|) of POPN shows extremely low level of significance. However, if omit POPN, the model only has 3 independent variables. The decision should be made after the summary and plots are analyzed.
Try a new model with PROFIT divided by OUTLETS as a dependent variable and omit POPN.
RM6 <- lm(Profit_of_outlets ~ Area + Commis, data = df)
summary(RM6)
##
## Call:
## lm(formula = Profit_of_outlets ~ Area + Commis, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9112 -0.5961 0.0748 0.8306 2.4742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.21035 0.42121 17.118 < 2e-16 ***
## Area -0.15688 0.02574 -6.096 1.79e-07 ***
## Commis 2.03285 0.37057 5.486 1.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.268 on 48 degrees of freedom
## Multiple R-squared: 0.5527, Adjusted R-squared: 0.534
## F-statistic: 29.65 on 2 and 48 DF, p-value: 4.124e-09
plot(RM6)
Comparing RM6 and RM5, we conclude POPN should be omitted.
Part C (25 marks): Locate and, if possible, correct any serious violations of assumptions.
Recall the plots of RM5. Outliers exist: 19,32,47. Of which 19 has the most influence on the model.
plot(RM5)
Thus, we remove 19 and stored it into a new dataframe.
df_regression <- df[-(19), ]
df_regression <- subset(df_regression, select=c(Dist,Area,Commis, Profit_of_outlets))
Lastly, we show the revised model
Model_regression <- lm(Profit_of_outlets ~ Area + Commis, data = df_regression)
summary(Model_regression)
##
## Call:
## lm(formula = Profit_of_outlets ~ Area + Commis, data = df_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00856 -0.57344 -0.05051 0.66923 2.41762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.36532 0.35043 21.018 < 2e-16 ***
## Area -0.16997 0.02149 -7.908 3.50e-10 ***
## Commis 2.21934 0.30944 7.172 4.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.05 on 47 degrees of freedom
## Multiple R-squared: 0.6787, Adjusted R-squared: 0.665
## F-statistic: 49.64 on 2 and 47 DF, p-value: 2.584e-12
Part D (20 marks): Once you have completed Parts A, B, and C, please write a brief report (not to exceed 500 words) to the marketing managers and explain your findings; the managers are not familiar with the technical language of statistics, although they do have an idea what a standard deviation is.
Dear marketing managers,
In order to better evaluate the field sales representatives’ performance and how the variables suspected effect the performance, regression models are built and analyzed. The findings and suggestions are listed as follows.
In the original model(RM1), Straightforwardly, ‘Profit’ is used as the dependent variable. ‘Popn’, ‘Area’, ‘Commis’ and ‘Outlets’ should all be involved. The summary table shows that ‘Outlets’ is the most significant while ‘Popn’ and ‘Outlets’ contributed very little to ‘Profit’. The results indicate that sales representatives motivated in a great deal by compensation. With all other variables being equal, a full-commission representative can generate 316.7 more profit than a partially salaried representatives. However, with an increase ‘area’, the ‘Profit’ only decrease 22.96. Moreover, Pr(>|t|) show low level of significant of ‘Area’ and ‘Outlets’. The model indicates that 62.15% of the variance in the dependent variable that is predictable from the independent variable. Which is fair enough.
To analyze the combination effect of ‘Commis’ and ‘Outlets’, we find that with the involvement of both ‘Commis’ and ‘Outlets’, the models fit better to the dataset. Moreover, in RM1, the estimate of ‘Commis’ and ‘Outlets’ are 0.8301 and 316.7449, the absolute value of which is higher than the absolute in RM2 and RM3, -0.4619 and 310.156. So we can indicate that the two variables do have combination effect.
As mentioned in the question, ‘Profit’ may be inflated for representatives with many outlets. Changing the dependent variable into profit per outlets, we did another regression. The plots and parameters indicate that it is more reasonable to evaluate the performance of sales representatives using profit per outlets, that is divide ‘Profit’ by ‘Outlets’.
After the dependent variable is determined, we need to examine all the independent variables and remove those are not significant. Note that we only have three variable now, ‘Area’, ‘Popn’ and ‘Commis’. First, examine wether the variables are dependent on each other. In other words, if one variable is highly correlated with another, one of them should be removed. In our case ‘Popn’ and ‘Area’ is highly correlated. Moreover, the analysis above shows that ‘Popn’ has a low level of significance. Hence, ‘Popn’ should bd removed from the model.
Lastly, the dataset contains outliers. Outliers refer to the usual or extreme rows that will effect the model in a great deal. Dist = 19 in our case is the outlier that should be removed. Then we can get our model with profit per outlets as dependent variable and ‘Area’ and ‘Commis’ as independent variable.
To sum up, it is more reasonable to use profit per outlet to evaluate the sales representatives’ performance. ‘Commis’ is the most critical element while ‘Area’ also should be considered. The representation should be offered full-commission to be motivated. On the contrary, ‘Popn’ should not be considered as it is not significant and is highly correlated to ‘Area’. Unusual or extreme situation should be omitted to obtain a fair evaluation.